Based on recorded storm data from 1950 through 2011, this document tries to provide some insight into the effects of severe weather on the both the economic and population of United States of America. These kinds of answers can be beneficial to plan responses to severe weather events and to prepare contingency plans.
We found that convection events (Lightning, Tornadoes, Thunderstorm Wind, Hail) are the most harmful to public health. We also found that Flood events (Flash Floods, River Floods) are the most damaging to property and crops. Further, we looked at states having most damages to property and crops and states which had more population damages.
Load the libraries we will need:
library(ggplot2)
library(maps)
library(mapproj)
library(rCharts)
Our data is derived from the NOAA Storm Database.
Read in the data, mapping as many fields to numerical fields as possible. We are not converting the dates at this point, as we do not need the dates in our analysis. More information about the data file is available from the National Weather Service Storm Data Documentation. The rest of the data is read in from huge csv containing more than 9 million observations.
The initial set has 902,297 observations. We first throw away all data that does not contain information we are interested in by filtering out data that did not cause fatalities, injuries or damage. Let’s take a look at summary of the initial data set
summary(data)
## STATE__ BGN_DATE BGN_TIME TIME_ZONE
## Min. : 1.0 Length:902297 Length:902297 Length:902297
## 1st Qu.:19.0 Class :character Class :character Class :character
## Median :30.0 Mode :character Mode :character Mode :character
## Mean :31.2
## 3rd Qu.:45.0
## Max. :95.0
##
## COUNTY COUNTYNAME STATE EVTYPE
## Min. : 0.0 Length:902297 Length:902297 Length:902297
## 1st Qu.: 31.0 Class :character Class :character Class :character
## Median : 75.0 Mode :character Mode :character Mode :character
## Mean :100.6
## 3rd Qu.:131.0
## Max. :873.0
##
## BGN_RANGE BGN_AZI BGN_LOCATI
## Min. : 0.000 Length:902297 Length:902297
## 1st Qu.: 0.000 Class :character Class :character
## Median : 0.000 Mode :character Mode :character
## Mean : 1.484
## 3rd Qu.: 1.000
## Max. :3749.000
##
## END_DATE END_TIME COUNTY_END COUNTYENDN
## Length:902297 Length:902297 Min. :0 Length:902297
## Class :character Class :character 1st Qu.:0 Class :character
## Mode :character Mode :character Median :0 Mode :character
## Mean :0
## 3rd Qu.:0
## Max. :0
##
## END_RANGE END_AZI END_LOCATI
## Length:902297 Length:902297 Length:902297
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
##
## LENGTH WIDTH F
## Min. : 0.0000 Min. : 0.000 Length:902297
## 1st Qu.: 0.0000 1st Qu.: 0.000 Class :character
## Median : 0.0000 Median : 0.000 Mode :character
## Mean : 0.2301 Mean : 7.503
## 3rd Qu.: 0.0000 3rd Qu.: 0.000
## Max. :2315.0000 Max. :4400.000
##
## MAG FATALITIES INJURIES
## Min. : 0.0 Min. : 0.0000 Min. : 0.0000
## 1st Qu.: 0.0 1st Qu.: 0.0000 1st Qu.: 0.0000
## Median : 50.0 Median : 0.0000 Median : 0.0000
## Mean : 46.9 Mean : 0.0168 Mean : 0.1557
## 3rd Qu.: 75.0 3rd Qu.: 0.0000 3rd Qu.: 0.0000
## Max. :22000.0 Max. :583.0000 Max. :1700.0000
##
## PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP
## Min. : 0.00 Length:902297 Min. : 0.000 Length:902297
## 1st Qu.: 0.00 Class :character 1st Qu.: 0.000 Class :character
## Median : 0.00 Mode :character Median : 0.000 Mode :character
## Mean : 12.06 Mean : 1.527
## 3rd Qu.: 0.50 3rd Qu.: 0.000
## Max. :5000.00 Max. :990.000
##
## WFO STATEOFFIC ZONENAMES LATITUDE
## Length:902297 Length:902297 Length:902297 Min. : 0
## Class :character Class :character Class :character 1st Qu.:2802
## Mode :character Mode :character Mode :character Median :3540
## Mean :2875
## 3rd Qu.:4019
## Max. :9706
## NA's :47
## LONGITUDE LATITUDE_E LONGITUDE_ REMARKS
## Min. :-14451 Min. : 0 Min. :-14455 Length:902297
## 1st Qu.: 7247 1st Qu.: 0 1st Qu.: 0 Class :character
## Median : 8707 Median : 0 Median : 0 Mode :character
## Mean : 6940 Mean :1452 Mean : 3509
## 3rd Qu.: 9605 3rd Qu.:3549 3rd Qu.: 8735
## Max. : 17124 Max. :9706 Max. :106220
## NA's :40
## REFNUM
## Min. : 1
## 1st Qu.:225575
## Median :451149
## Mean :451149
## 3rd Qu.:676723
## Max. :902297
##
We then filter out the unwanted events and keep only those events that have some kind of fatalities.
smallData <- data[data$FATALITIES > 0 | data$INJURIES > 0 | data$PROPDMG > 0 |
data$CROPDMG > 0, ]
This leaves us with 254,633 observations.
The EVTYPE fields contains a large number of errors and issues. In order to report on the data, we will add an additional column named category that contains the event category as used by the NCDC: - convection
- extreme temperature - flood - marine - tropical cyclon - winter - other
This is also the order of importance with which we will treat the various events. Convection events are the most important, so this order will also decide the tie-breaker if an event belongs to more than one category.
The PROPDMG and CROPDMG fields need some conversion before we can do math on them. We add two extra columns that contain the property and crop damage.
we replace each class using a standard to define the damage expenses.
Having gone through the steps above, we now have clean dataset containing all the information to perform exploratory data analysis. Let us know begin exploring what the dataset has to say
We begin by exploring all the variables in the data set.
## Scale for 'x' is already present. Adding another scale for 'x', which will replace the existing scale.
Figure above shows the frequency graph of number of events occuring in the respective counties. The mean value of frequence of occurrence is around 22, thus showing that on an average you can expect a place to have an event occuring 22 time over the years between 1950-2011.
mean(out$count)
## [1] 22.20378
After adding the category variable let’s take a look at a few stats of the subsetted dataset
summary(smallData)
## STATE__ BGN_DATE BGN_TIME TIME_ZONE
## Min. : 1.00 Length:254633 Length:254633 Length:254633
## 1st Qu.:19.00 Class :character Class :character Class :character
## Median :29.00 Mode :character Mode :character Mode :character
## Mean :30.12
## 3rd Qu.:45.00
## Max. :95.00
##
## COUNTY COUNTYNAME STATE EVTYPE
## Min. : 0.00 Length:254633 Length:254633 Length:254633
## 1st Qu.: 31.00 Class :character Class :character Class :character
## Median : 77.00 Mode :character Mode :character Mode :character
## Mean : 96.26
## 3rd Qu.:129.00
## Max. :869.00
##
## BGN_RANGE BGN_AZI BGN_LOCATI
## Min. : 0.000 Length:254633 Length:254633
## 1st Qu.: 0.000 Class :character Class :character
## Median : 0.000 Mode :character Mode :character
## Mean : 1.207
## 3rd Qu.: 1.000
## Max. :177.000
##
## END_DATE END_TIME COUNTY_END COUNTYENDN
## Length:254633 Length:254633 Min. :0 Length:254633
## Class :character Class :character 1st Qu.:0 Class :character
## Mode :character Mode :character Median :0 Mode :character
## Mean :0
## 3rd Qu.:0
## Max. :0
##
## END_RANGE END_AZI END_LOCATI
## Length:254633 Length:254633 Length:254633
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
##
## LENGTH WIDTH F
## Min. : 0.0000 Min. : 0.00 Length:254633
## 1st Qu.: 0.0000 1st Qu.: 0.00 Class :character
## Median : 0.0000 Median : 0.00 Mode :character
## Mean : 0.6651 Mean : 21.56
## 3rd Qu.: 0.0000 3rd Qu.: 0.00
## Max. :1845.0000 Max. :4400.00
##
## MAG FATALITIES INJURIES
## Min. : 0.00 Min. : 0.0000 Min. : 0.0000
## 1st Qu.: 0.00 1st Qu.: 0.0000 1st Qu.: 0.0000
## Median : 0.00 Median : 0.0000 Median : 0.0000
## Mean : 31.43 Mean : 0.0595 Mean : 0.5519
## 3rd Qu.: 52.00 3rd Qu.: 0.0000 3rd Qu.: 0.0000
## Max. :3430.00 Max. :583.0000 Max. :1700.0000
##
## PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP
## Min. : 0.00 Length:254633 Min. : 0.000 Length:254633
## 1st Qu.: 2.00 Class :character 1st Qu.: 0.000 Class :character
## Median : 5.00 Mode :character Median : 0.000 Mode :character
## Mean : 42.75 Mean : 5.411
## 3rd Qu.: 25.00 3rd Qu.: 0.000
## Max. :5000.00 Max. :990.000
##
## WFO STATEOFFIC ZONENAMES LATITUDE
## Length:254633 Length:254633 Length:254633 Min. : 0
## Class :character Class :character Class :character 1st Qu.: 0
## Mode :character Mode :character Mode :character Median :3440
## Mean :2738
## 3rd Qu.:4002
## Max. :7025
## NA's :4
## LONGITUDE LATITUDE_E LONGITUDE_ REMARKS
## Min. :-14451 Min. : 0 Min. :-14455 Length:254633
## 1st Qu.: 0 1st Qu.: 0 1st Qu.: 0 Class :character
## Median : 8422 Median : 0 Median : 0 Mode :character
## Mean : 6545 Mean :1758 Mean : 4218
## 3rd Qu.: 9231 3rd Qu.:3641 3rd Qu.: 8835
## Max. : 17124 Max. :7025 Max. : 17124
## NA's :4
## REFNUM category propertydamageEXP
## Min. : 1 Length:254633 Min. :1.000e+00
## 1st Qu.:281406 Class :character 1st Qu.:1.000e+03
## Median :473485 Mode :character Median :1.000e+03
## Mean :484335 Mean :2.025e+05
## 3rd Qu.:703590 3rd Qu.:1.000e+03
## Max. :902260 Max. :1.000e+09
##
## propertydamage cropdamageEXP cropdamage
## Min. :0.000e+00 Min. :1.000e+00 Min. :0.000e+00
## 1st Qu.:2.000e+03 1st Qu.:1.000e+00 1st Qu.:0.000e+00
## Median :1.000e+04 Median :1.000e+00 Median :0.000e+00
## Mean :1.678e+06 Mean :3.568e+04 Mean :1.928e+05
## 3rd Qu.:3.500e+04 3rd Qu.:1.000e+03 3rd Qu.:0.000e+00
## Max. :1.150e+11 Max. :1.000e+09 Max. :5.000e+09
##
Let us explore a few univariate plots to find some trend in the data set.
ggplot(smallData, aes(x=category)) + geom_histogram(binwidth=50)+
xlab("Categories of Events Responsible for most damage")
The figure above shows a univariate plot showing the number of occurences of different events that have occured between the years 1950-2011. Convection events have occured the most over these years with second place - floods. Convection events comprise of the following - LIGHTING - TORNADO - WND - HAIL
It is also worth exploring the distribution of fatalities and injuries over the years
ggplot(smallData, aes(x=FATALITIES)) + geom_histogram()+
xlab("Number of Fatalities")+xlim(0,10)
The fatalities plot show that most events did not incur any fatalities. This is due to the fact that we are looking at individual events at individual places over the years. Yearly Fatalities and Injuries would sum up to a bigger number. Therefore, plotting injuries too would not make any sense as both of these plots would show the same variation. It is better to capture property damage.
ggplot(smallData, aes(x=PROPDMG)) + geom_histogram()+
xlab("Property Damage in Millions")+xlim(0,30)
This plot would have surely captured the data as property damages occur even if threat to population is less even in the mildest of events. we find that on an average around 42 million USD worth of damages are caused due to natural calamities.
summary(smallData$PROPDMG)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 2.00 5.00 42.75 25.00 5000.00
Another important aspect of economic damages that are caused by natural disasters are crop damages. Lets take a look at what the graphs have to say:
summary(smallData$CROPDMG)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.000 0.000 5.411 0.000 990.000
ggplot(smallData, aes(x=CROPDMG)) + geom_histogram()+
xlab("Crops Damage in Millions")+xlim(0,30)
Damages to agriculture and crops is less as compared to property damages.
ll<-data.frame(table(smallData$EVTYPE))
ll<-ll[order(-ll[,2]),]
ll[1:20,]
## Var1 Freq
## 418 TSTM WIND 63234
## 362 THUNDERSTORM WIND 43655
## 404 TORNADO 39944
## 132 HAIL 26130
## 72 FLASH FLOOD 20967
## 251 LIGHTNING 13293
## 379 THUNDERSTORM WINDS 12086
## 85 FLOOD 10175
## 192 HIGH WIND 5522
## 348 STRONG WIND 3370
## 476 WINTER STORM 1508
## 166 HEAVY SNOW 1342
## 156 HEAVY RAIN 1105
## 468 WILDFIRE 857
## 235 ICE STORM 708
## 453 URBAN/SML STREAM FLD 702
## 58 EXCESSIVE HEAT 698
## 200 HIGH WINDS 657
## 432 TSTM WIND/HAIL 441
## 413 TROPICAL STORM 416
The table above shows the top 20 event types and TSTM and thunderstorm are the top 2 most recurrent events in USA. After exploring the most influential variables, let’s have a look at some bivariate plots to understand what the data has to say more to us.
Let’s have a closer look at the regions with lot of events occuring in the time span between 1950-2011. I am subsetting the data so as to plot only those county’s that have witnessed more than 1500 events
The County of Washington has the highest number of weather events that has occurred over these years. Let’s now have a close look as to what events have taken place most in washington as per the category we have assigned. In order to further investigate what type of events were occurring in washington, I decided to subset the data to have alook at the county of Washington more carefully.
washington2<-subset(smallData,
COUNTYNAME == "WASHINGTON" & FATALITIES > 0)
washington1<-table(washington2$category)
washington<-data.frame(events=names(unlist(washington1)),
count = unlist(washington1)[], stringsAsFactors = FALSE)
ggplot(washington, aes(x=events, y = count, fill = count))+
geom_bar(stat="identity")+xlab("Events")+
ylab("Number of occurrences")+
ggtitle("Most prominent Disaster in Washington")
Therefore, we can conclude that the most prominent disaster in Washington is Convection. Let us now see if these were the cause of the most number of fatalities in the case of washington. The summary of washington dataset gives us the following values.
summary(washington)
## events count
## Length:2 Min. :11.00
## Class :character 1st Qu.:16.25
## Mode :character Median :21.50
## Mean :21.50
## 3rd Qu.:26.75
## Max. :32.00
Breaking our category variable into individual events, we get the following result
table(washington2$EVTYPE)
##
## FLASH FLOOD FLOOD LIGHTNING TORNADO TSTM WIND
## 7 4 6 22 4
Aggregating the dataset over years, we can plot more histograms depicting results of how the number of weather events have grown over the years. One of the primary reasons I wished to do this is to find out for a fact that has the number of events grown considerably over the years 1950-2011.
As you can see over the past few years there has been a considerable increase in the number of events. This provides isnights to the dact that due to global warming and other environmental degredation, there has been a rise in calamities over the years.
From the line graph above, we can see the trends more clearly as to how the events have grown considerably. In order to further look at how fatalities and injuries have risen, let’s take a look at a few more graphs.
So expecting so many events yearly, it would come off as a natural expectation that the number of fatalities should also increase yearly. Let’s plot the fatalities over the years.
The number of fatalities rise at the same pace the number of events have increased, however in recent times it has taken a dip urging the question as there been some kind of mitigation done by gvernement over the years. A closer look at the INJURIES gives us more insights about the results. If the same trend is followed, we can see that there has been considerable efforts in educating the masses about how to mitigate the destructive effects of the weather.
One of the other isights I wish to explore is the state wise event changes that has taken place over the years. For this, I aggregated the data over the states to plot a few more results.
Lets take a look at state wise event count. This will give us details regarding which state has been hit with most number of events over the span of 1950-2011. It is important to know which states have been effected the most as this will form a base so as to create more awareness among the masses with respect to the events that occur in these states.
Clearly, the state of Texas has seen a lot of events over the years. However, to get an intuition of which state has been hit with most fatal events over the years. Let’s take a look at state wise population damage. It would be right to expect that Texas would also be hit the most fatalities, however we cannot comment unless we look at the graph.
The state of illinois has been most severly hit by weather events culminating in many deaths. This could help provide data to the illinois government to support its residents during natural events that are to occur in the future. Also, in second place we have the state of Texas thus giving us insights that it has been hit with some of the worst events.
Property damage on the other hand has shows data consisent with further analysis at the end of the report. We find that kentucky is the worst hit in terms of property damage. The histogram depicting the crop damages that have occured over the years are depicted below:
Let’s look at a few scatter plots to identify - if there is any kind of relationship present between a few features.
We begin by comparing two features i.e fatalities and injuries. We get the following result:
The graph shows some kind of linear relationship between the two variables which would make sense: Let’s say a person is injured due to an unforseeable calamity like flood or lightning. If this event were to have a very serious effect on the person, he/she could succumb to the injury which could lead to a fatality.
Let’s look at the relationship between property damage and crop damage:
The x-axis depicts property damage, while the y axis shows corresponding crop damage. I have faceted accroding to categories of events taking place so as to get more perspective into relationship between events.
For the large part we find that Convection events cause a lot of property damage and crop damage. this is due to the the fact that such events include cyclone and tornadoes which have a disastrous effect on both. Hence we see a smooth linear regression curve in the first grid.
Events like Extreme Temperatures have a large effect on crops rather than property as shown in the second rid. Thus, we get a perspective as to what we would be expecting in the case of economical and health damage that these weather events cause from the plot above.
This section will cover more detailed plots and multiple scatter plots togther to find some kind of relationships between variables quickly.
Figure above shows the fatalities and in which Latitude and Longitude they are concentrated. Most of the data shows a slight amount of fatalities due to the overplotting of blue dots. In order to get a better perspective we will use the maps package to plot the fatalities on Map of USA.
Figure above shows multiple deaths in the case of Convection Events and Flood Events.
Figure above shows the relationship between various damages and fatalities. This is to find some kind of relationship between the economic and health related damages the weather events cause damage to. Insights: - Fatalities and crop damage have no relation whatsoever as expected. - Injuries and Crop damage have once agin no relationship whatsover. - Injuries and Fatalaties seems to have an almost linear relationship with property damage. - property and crop damage see, to follow a direct relationship at some places. On Reviewing all the insights from the graphs and exploratory data analysis, following statements can be made. - Washington suffers from a large number of conviction events. - Convection events cause the most damage in terms of health. - Flooding events have the most effect on property and crops.
Calculate a total of all fatalities and injuries, so that we can find what events have the highest number of incidents. The new column is called incidents.
smallData$incidents = smallData$FATALITIES + smallData$INJURIES
We create a new set with the aggregate of the incidents grouped by the event types.
incidentData <- aggregate(list(incidents = smallData$incidents),
by = list(event = smallData$category),
FUN = sum, na.rm = TRUE)
Here is the overview of the event categories with the number of incidents
incidentData$event <- reorder(incidentData$event, -incidentData$incidents)
ggplot(incidentData, aes(y = incidents)) +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
geom_bar(aes(x = event), data = incidentData, stat = "identity") +
ggtitle("Fatalities and Injuries") +
xlab("Event Category") + ylab("No. of Fatalities and Injuries")
Clearly the convection events (Lightning, Tornadoes, Thunderstorm Wind, Hail) has the greatest effect on injury and fatality.
Add a field with a total of all damage, both to property as well as to crop, so that we can find our what events cause the highest amount of damage. The new column is called damage. The column is in billions of dollars.
smallData$damage = ((smallData$propertydamage + smallData$cropdamage)/1e+09)
We create a new set with the aggregate of the damage grouped by the event types.
damageData <- aggregate(list(damage = smallData$damage),
by = list(event = smallData$category),
FUN = sum, na.rm = TRUE)
Here is the overview of the event categories with the amount of damage
damageData$event <- reorder(damageData$event, -damageData$damage)
ggplot(damageData, aes(y = damage)) +
theme(axis.text.x = element_text(angle = 90,
hjust = 1)) +
geom_bar(aes(x = event), data = damageData, stat = "identity") +
ggtitle("Property and crop damage") + xlab("Event Category") +
ylab("Amount of damage (billions of $)")
Clearly the flooding events (Flash Flood, River Flood) have the greatest effect on property damage and crop damage.
After Converting the data into an aggregate form and replacing state abbreviations with their name, I built a new csv containing the aggregate data and removed the states that were not present in the maps package
Figure above shows thet total damage to health i.e lives and as we can see the most affected states are the western states due to the fact that they are victims of events like tornadoes and hurricanes. Let us know look at the damage to Economy which includes property and crop damages
As we can see the western and central states of USA have been hit hard causing huge losses to economy fro the years 1950-2011.
Tropical storms/hurricanes are the most dangerous events and they cause the most destruction to property. These events come under Convection events and hence as seen in the trends above, we find that convection events on an average occur more than other categories of events. Thus, we can say that relef measures fo these kinds of events are a must to mitigate the damages done.
This question makes us think of the most obvious answer which would be either floods or droughts as they cause the most damages to vegetation on an average. Even from the graph our intuition is proved as drought and flood are he major causes of damages to crops.